-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: be able to read Stata files without reading them fully into memory #48922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7bc15e8
to
1532991
Compare
There seem to be some failing tests on windows:
|
Right, e.g.
looks like not all handles are being correctly, which is apparently a test suite bug, fixed in 99d3540 :) |
99d3540
to
121ada5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, otherwise looks good to me!
LGTM cc @bashtage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test changes indicate that this might be introducing a behavior change with respect to leaving handles open in existing code. Can you verif y this isn't the case and restore the other tests that should work unmodified?
pandas/tests/io/test_stata.py
Outdated
@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion(): | |||
with tm.assert_produces_warning(InvalidColumnName): | |||
data.to_stata(path, value_labels=value_labels) | |||
|
|||
reader = StataReader(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like all of this style of change. StataReader
is self-closing. Hsa this changed? If so, that is introducing a new bug to user code that needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashtage StataReader is not self-closing. It has had a close
method since 59dd18b (July 2015), to, quote, "ensure closing of the path". (At that point, when you'd pass in a string-like to the ctor, it would open it as a regular file and not buffer things into memory.) You can see 59dd18b also changes some tests to use with
.
Some tests either didn't use that, or didn't call close()
, which in turn caused cleanup issues on Windows now that we don't buffer everything into memory (which was a regression in 6d1541e, 2020).
Any user code that had not used StataReader()
as a context manager had technically been in violation of the protocol established in 59dd18b, it just wouldn't surface since Pandas used to read everything to memory for the last two years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a strong preference for there to be no test changes aside from those essential for this PR, so please revert these. It is fine to do a follow-up PR to clean the test code to use best practices. This just ensures that there are no visible side effects of other changes, e.g., leaking handles with existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaked handles did CI fail, hence this separate commit to fix those up. See earlier discussion:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior arose from a human mistake in the refactoring in 6d1541e.
You call it a mistake, I call it a coding-efficient choice that was not memory efficient.
I don't think it's first and foremost an optimization as the current behavior entirely prevents using chunked or iterator reading on machines with less available memory than the stata file's size on disk.
The usual response is to get more memory. I think extending the reading to lower memory machines is an enhancement.
I don't think that's true. The test suite changes (where a block already ending with .close() wasn't changed to a with) are:
Most of these self-close. There is only one that would need an explicit call to close if you implement this change only for the case where an iterator is used. (iterator == True or chunksize not None).
Reading a data label and variable labels only, no iterator mode:
It is self-closing in main, so no leaks.
Chunked Stata reader left open:
It is not. Closed since iterator is exhausted.
Chunked Stata reader left open.
Yes. This one could be fixed in the test.
Only value labels being read, no iterator mode (this occurs three times with minor variations):
Should work without modification of the tests since not an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To move this forwards, why not try:
- implement it only in the case where an iterator is indicated
- Restore the test suite so we can see the failures
- Leave in the new tests, but make adapt them so that they are using an iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You call it a mistake, I call it a coding-efficient choice that was not memory efficient.
Please recall the reading code before that inadvertent change was memory-efficient.
Lines 939 to 948 in 59dd18b
if isinstance(path_or_buf, (str, compat.text_type, bytes)): | |
self.path_or_buf = open(path_or_buf, 'rb') | |
else: | |
# Copy to BytesIO, and ensure no encoding | |
contents = path_or_buf.read() | |
try: | |
contents = contents.encode(self._default_encoding) | |
except: | |
pass | |
self.path_or_buf = BytesIO(contents) |
The usual response is to get more memory.
Which is quite user-hostile when in this case the situation can be fixed with a +61/-26 line diff. (On that note, the person who had the issue with a 30-some-gigabyte Stata file on Stack Overflow reached out to me on Twitter for help, and I suggested trying to patch their Pandas' stata.py
with the one from this branch. I think that's friendlier and quite less expensive (monetarily, ecologically, etc.) than to get more memory.)
I think extending the reading to lower memory machines is an enhancement.
See above. Restoring being able to read large files on lower memory machines is a regression fix or a bug fix.
It is self-closing in main, so no leaks.
On main
, StataReader
buffers to a BytesIO
and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader
on main
(or rather, since 6d1541e) is "self-closing".
It is not. Closed since iterator is exhausted.
Which is technically a bug in itself, see #48922 (comment)
Yes. This one could be fixed in the test.
I'm not following you – why can this one be fixed in tests?
Should work without modification of the tests since not an iterator.
It only works on main
without a leak because StataReader
on main
always buffers into memory, and an unclosed BytesIO
does not raise a resource warning. It wouldn't work on any version prior to 6d1541e without a leak occurring. All of these three were added in fd151ba when the inadvertent buffering code was already in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On main, StataReader buffers to a BytesIO and immediately closes the original handle in the constructor. If that's your definition of "self-closing", then I'm not sure how to interpret your other arguments, since all use of StataReader on main (or rather, since 6d1541e) is "self-closing".
The is exactly my point. The current behavior of StataReader
is to never leak file handles. It has been this way for 2 years now. This is why the test suite passes despite some questionable code. Ideally, this behavior should not be changed without a deprecation cycle. It is probably necessary to change it to get the iterator to work without buffering the entire file, so I think it is acceptable to make this change in this limited case. I would also think the docs for iterator and chunksize should be strengthened to tell users that they must use a context manager or close the file themselves.
I do not think it is OK to change it where it isn't essential without letting users know of a change that could lead to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to focus the fix on #48700 only and not expand the fix to apply to other cases.
pandas/tests/io/test_stata.py
Outdated
block = block.set_index("index") | ||
assert "cats" in block | ||
tm.assert_series_equal(block.cats, df.cats.iloc[2 * i : 2 * (i + 1)]) | ||
with StataReader(path, chunksize=2, order_categoricals=False) as reader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator mode is already self-closing. Why is the context manager introduced here?
Lines 1704 to 1710 in 2402abe
if read_len <= 0: | |
# Iterator has finished, should never be here unless | |
# we are reading the file incrementally | |
if convert_categoricals: | |
self._read_value_labels() | |
self.close() | |
raise StopIteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, in this test we're reading the reader through, but if we weren't, then it would be good form to close the reader (and the underlying file handle), just like with any file handles in Python.
In fact, you might argue the self.close()
there in the reader is inappropriate, since an esoteric user could have e.g. saved the reader's underlying file handle's .seek()
position at some point, would then rewind the handle and have the reader read some more... but perhaps that's an esoteric enough use case that it doesn't need to be covered. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate essential test file changes from those that should continue to work without issue after the StataReader changes.
pandas/tests/io/test_stata.py
Outdated
@@ -2101,9 +2127,9 @@ def test_non_categorical_value_label_name_conversion(): | |||
with tm.assert_produces_warning(InvalidColumnName): | |||
data.to_stata(path, value_labels=value_labels) | |||
|
|||
reader = StataReader(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaked handles should trigger a CI failure, and on Windows, open handles usually result in a failure to delete the file which is a test failure.
As for a way forward, maybe could consider adding a keyword argument If If using this strategy, then the current test suite would still pass unmodified. I'm always -1 for adding to the API though. |
@akx pretty much all of the maintainers are going to defer to bashtage on this |
Yeah no worries @jbrockmendel, we're continuing in #49228 (the v2 of this) :) |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing in favor of #49228 |
* CLN: StataReader: refactor repeated struct.unpack/read calls to helpers * CLN: StataReader: replace string concatenations with f-strings * CLN: StataReader: prefix internal state with underscore * FIX: StataReader: defer opening file to when data is required * FIX: StataReader: don't buffer entire file into memory unless necessary Refs #48922 * DOC: Note that StataReaders are context managers * FIX: StataReader: don't close stream implicitly * Apply review changes
Fixes #48700
Refs #9245
Refs #37639
Regressed in 6d1541e
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.